339 research outputs found
The Missing Indicator Method: From Low to High Dimensions
Missing data is common in applied data science, particularly for tabular data
sets found in healthcare, social sciences, and natural sciences. Most
supervised learning methods only work on complete data, thus requiring
preprocessing such as missing value imputation to work on incomplete data sets.
However, imputation alone does not encode useful information about the missing
values themselves. For data sets with informative missing patterns, the Missing
Indicator Method (MIM), which adds indicator variables to indicate the missing
pattern, can be used in conjunction with imputation to improve model
performance. While commonly used in data science, MIM is surprisingly
understudied from an empirical and especially theoretical perspective. In this
paper, we show empirically and theoretically that MIM improves performance for
informative missing values, and we prove that MIM does not hurt linear models
asymptotically for uninformative missing values. Additionally, we find that for
high-dimensional data sets with many uninformative indicators, MIM can induce
model overfitting and thus test performance. To address this issue, we
introduce Selective MIM (SMIM), a novel MIM extension that adds missing
indicators only for features that have informative missing patterns. We show
empirically that SMIM performs at least as well as MIM in general, and improves
MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on
real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM
on clinical tasks generated from the MIMIC-III database of electronic health
records
Interpretable Survival Analysis for Heart Failure Risk Prediction
Survival analysis, or time-to-event analysis, is an important and widespread
problem in healthcare research. Medical research has traditionally relied on
Cox models for survival analysis, due to their simplicity and interpretability.
Cox models assume a log-linear hazard function as well as proportional hazards
over time, and can perform poorly when these assumptions fail. Newer survival
models based on machine learning avoid these assumptions and offer improved
accuracy, yet sometimes at the expense of model interpretability, which is
vital for clinical use. We propose a novel survival analysis pipeline that is
both interpretable and competitive with state-of-the-art survival models.
Specifically, we use an improved version of survival stacking to transform a
survival analysis problem to a classification problem, ControlBurn to perform
feature selection, and Explainable Boosting Machines to generate interpretable
predictions. To evaluate our pipeline, we predict risk of heart failure using a
large-scale EHR database. Our pipeline achieves state-of-the-art performance
and provides interesting and novel insights about risk factors for heart
failure
Swift X-Ray Observations of Classical Novae. II. The Super Soft Source sample
The Swift GRB satellite is an excellent facility for studying novae. Its
rapid response time and sensitive X-ray detector provides an unparalleled
opportunity to investigate the previously poorly sampled evolution of novae in
the X-ray regime. This paper presents Swift observations of 52
Galactic/Magellanic Cloud novae. We included the XRT (0.3-10 keV) X-ray
instrument count rates and the UVOT (1700-8000 Angstroms) filter photometry.
Also included in the analysis are the publicly available pointed observations
of 10 additional novae the X-ray archives. This is the largest X-ray sample of
Galactic/Magellanic Cloud novae yet assembled and consists of 26 novae with
super soft X-ray emission, 19 from Swift observations. The data set shows that
the faster novae have an early hard X-ray phase that is usually missing in
slower novae. The Super Soft X-ray phase occurs earlier and does not last as
long in fast novae compared to slower novae. All the Swift novae with
sufficient observations show that novae are highly variable with rapid
variability and different periodicities. In the majority of cases, nuclear
burning ceases less than 3 years after the outburst begins. Previous
relationships, such as the nuclear burning duration vs. t_2 or the expansion
velocity of the eject and nuclear burning duration vs. the orbital period, are
shown to be poorly correlated with the full sample indicating that additional
factors beyond the white dwarf mass and binary separation play important roles
in the evolution of a nova outburst. Finally, we confirm two optical phenomena
that are correlated with strong, soft X-ray emission which can be used to
further increase the efficiency of X-ray campaigns.Comment: Accepted to ApJ Supplements. Full data for Table 2 and Figure 17
available in the electronic edition. New version of the previously posted
paper since the earlier version was all set in landscape mod
The Fourteenth Data Release of the Sloan Digital Sky Survey: First Spectroscopic Data from the extended Baryon Oscillation Spectroscopic Survey and from the second phase of the Apache Point Observatory Galactic Evolution Experiment
The fourth generation of the Sloan Digital Sky Survey (SDSS-IV) has been in
operation since July 2014. This paper describes the second data release from
this phase, and the fourteenth from SDSS overall (making this, Data Release
Fourteen or DR14). This release makes public data taken by SDSS-IV in its first
two years of operation (July 2014-2016). Like all previous SDSS releases, DR14
is cumulative, including the most recent reductions and calibrations of all
data taken by SDSS since the first phase began operations in 2000. New in DR14
is the first public release of data from the extended Baryon Oscillation
Spectroscopic Survey (eBOSS); the first data from the second phase of the
Apache Point Observatory (APO) Galactic Evolution Experiment (APOGEE-2),
including stellar parameter estimates from an innovative data driven machine
learning algorithm known as "The Cannon"; and almost twice as many data cubes
from the Mapping Nearby Galaxies at APO (MaNGA) survey as were in the previous
release (N = 2812 in total). This paper describes the location and format of
the publicly available data from SDSS-IV surveys. We provide references to the
important technical papers describing how these data have been taken (both
targeting and observation details) and processed for scientific use. The SDSS
website (www.sdss.org) has been updated for this release, and provides links to
data downloads, as well as tutorials and examples of data use. SDSS-IV is
planning to continue to collect astronomical data until 2020, and will be
followed by SDSS-V.Comment: SDSS-IV collaboration alphabetical author data release paper. DR14
happened on 31st July 2017. 19 pages, 5 figures. Accepted by ApJS on 28th Nov
2017 (this is the "post-print" and "post-proofs" version; minor corrections
only from v1, and most of errors found in proofs corrected
Sloan Digital Sky Survey IV: Mapping the Milky Way, Nearby Galaxies, and the Distant Universe
We describe the Sloan Digital Sky Survey IV (SDSS-IV), a project encompassing three major spectroscopic programs. The Apache Point Observatory Galactic Evolution Experiment 2 (APOGEE-2) is observing hundreds of thousands of Milky Way stars at high resolution and high signal-to-noise ratios in the near-infrared. The Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey is obtaining spatially resolved spectroscopy for thousands of nearby galaxies (median ). The extended Baryon Oscillation Spectroscopic Survey (eBOSS) is mapping the galaxy, quasar, and neutral gas distributions between and 3.5 to constrain cosmology using baryon acoustic oscillations, redshift space distortions, and the shape of the power spectrum. Within eBOSS, we are conducting two major subprograms: the SPectroscopic IDentification of eROSITA Sources (SPIDERS), investigating X-ray AGNs and galaxies in X-ray clusters, and the Time Domain Spectroscopic Survey (TDSS), obtaining spectra of variable sources. All programs use the 2.5 m Sloan Foundation Telescope at the Apache Point Observatory; observations there began in Summer 2014. APOGEE-2 also operates a second near-infrared spectrograph at the 2.5 m du Pont Telescope at Las Campanas Observatory, with observations beginning in early 2017. Observations at both facilities are scheduled to continue through 2020. In keeping with previous SDSS policy, SDSS-IV provides regularly scheduled public data releases; the first one, Data Release 13, was made available in 2016 July
Sloan Digital Sky Survey IV: mapping the Milky Way, nearby galaxies, and the distant universe
We describe the Sloan Digital Sky Survey IV (SDSS-IV), a project encompassing three major spectroscopic programs. The Apache Point Observatory Galactic Evolution Experiment 2 (APOGEE-2) is observing hundreds of thousands of Milky Way stars at high resolution and high signal-to-noise ratios in the near-infrared. The Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey is obtaining spatially resolved spectroscopy for thousands of nearby galaxies (median ). The extended Baryon Oscillation Spectroscopic Survey (eBOSS) is mapping the galaxy, quasar, and neutral gas distributions between and 3.5 to constrain cosmology using baryon acoustic oscillations, redshift space distortions, and the shape of the power spectrum. Within eBOSS, we are conducting two major subprograms: the SPectroscopic IDentification of eROSITA Sources (SPIDERS), investigating X-ray AGNs and galaxies in X-ray clusters, and the Time Domain Spectroscopic Survey (TDSS), obtaining spectra of variable sources. All programs use the 2.5 m Sloan Foundation Telescope at the Apache Point Observatory; observations there began in Summer 2014. APOGEE-2 also operates a second near-infrared spectrograph at the 2.5 m du Pont Telescope at Las Campanas Observatory, with observations beginning in early 2017. Observations at both facilities are scheduled to continue through 2020. In keeping with previous SDSS policy, SDSS-IV provides regularly scheduled public data releases; the first one, Data Release 13, was made available in 2016 July
The Fourteenth Data Release of the Sloan Digital Sky Survey: First Spectroscopic Data from the Extended Baryon Oscillation Spectroscopic Survey and from the Second Phase of the Apache Point Observatory Galactic Evolution Experiment
The fourth generation of the Sloan Digital Sky Survey (SDSS-IV) has been in operation since 2014 July. This paper describes the second data release from this phase, and the 14th from SDSS overall (making this Data Release Fourteen or DR14). This release makes the data taken by SDSS-IV in its first two years of operation (2014–2016 July) public. Like all previous SDSS releases, DR14 is cumulative, including the most recent reductions and calibrations of all data taken by SDSS since the first phase began operations in 2000. New in DR14 is the first public release of data from the extended Baryon Oscillation Spectroscopic Survey; the first data from the second phase of the Apache Point Observatory (APO) Galactic Evolution Experiment (APOGEE-2), including stellar parameter estimates from an innovative data-driven machine-learning algorithm known as "The Cannon"; and almost twice as many data cubes from the Mapping Nearby Galaxies at APO (MaNGA) survey as were in the previous release (N = 2812 in total). This paper describes the location and format of the publicly available data from the SDSS-IV surveys. We provide references to the important technical papers describing how these data have been taken (both targeting and observation details) and processed for scientific use. The SDSS web site (www.sdss.org) has been updated for this release and provides links to data downloads, as well as tutorials and examples of data use. SDSS-IV is planning to continue to collect astronomical data until 2020 and will be followed by SDSS-V
Sloan Digital Sky Survey IV : mapping the Milky Way, nearby galaxies, and the distant universe
We describe the Sloan Digital Sky Survey IV (SDSS-IV), a project encompassing three major spectroscopic programs. The Apache Point Observatory Galactic Evolution Experiment 2 (APOGEE-2) is observing hundreds of thousands of Milky Way stars at high resolution and high signal-to-noise ratios in the near-infrared. The Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey is obtaining spatially resolved spectroscopy for thousands of nearby galaxies (median z ~ 0.03). The extended Baryon Oscillation Spectroscopic Survey (eBOSS) is mapping the galaxy, quasar, and neutral gas distributions between z ~ 0.6 and 3.5 to constrain cosmology using baryon acoustic oscillations, redshift space distortions, and the shape of the power spectrum. Within eBOSS, we are conducting two major subprograms: the SPectroscopic IDentification of eROSITA Sources (SPIDERS), investigating X-ray AGNs and galaxies in X-ray clusters, and the Time Domain Spectroscopic Survey (TDSS), obtaining spectra of variable sources. All programs use the 2.5 m Sloan Foundation Telescope at the Apache Point Observatory; observations there began in Summer 2014. APOGEE-2 also operates a second near-infrared spectrograph at the 2.5 m du Pont Telescope at Las Campanas Observatory, with observations beginning in early 2017. Observations at both facilities are scheduled to continue through 2020. In keeping with previous SDSS policy, SDSS-IV provides regularly scheduled public data releases; the first one, Data Release 13, was made available in 2016 July
Recommended from our members
Sloan Digital Sky Survey IV: Mapping the Milky Way, Nearby Galaxies, and the Distant Universe
We describe the Sloan Digital Sky Survey IV (SDSS-IV), a project encompassing
three major spectroscopic programs. The Apache Point Observatory Galactic
Evolution Experiment 2 (APOGEE-2) is observing hundreds of thousands of Milky
Way stars at high resolution and high signal-to-noise ratio in the
near-infrared. The Mapping Nearby Galaxies at Apache Point Observatory (MaNGA)
survey is obtaining spatially-resolved spectroscopy for thousands of nearby
galaxies (median redshift of z = 0.03). The extended Baryon Oscillation
Spectroscopic Survey (eBOSS) is mapping the galaxy, quasar, and neutral gas
distributions between redshifts z = 0.6 and 3.5 to constrain cosmology using
baryon acoustic oscillations, redshift space distortions, and the shape of the
power spectrum. Within eBOSS, we are conducting two major subprograms: the
SPectroscopic IDentification of eROSITA Sources (SPIDERS), investigating X-ray
AGN and galaxies in X-ray clusters, and the Time Domain Spectroscopic Survey
(TDSS), obtaining spectra of variable sources. All programs use the 2.5-meter
Sloan Foundation Telescope at Apache Point Observatory; observations there
began in Summer 2014. APOGEE-2 also operates a second near-infrared
spectrograph at the 2.5-meter du Pont Telescope at Las Campanas Observatory,
with observations beginning in early 2017. Observations at both facilities are
scheduled to continue through 2020. In keeping with previous SDSS policy,
SDSS-IV provides regularly scheduled public data releases; the first one, Data
Release 13, was made available in July 2016
Sloan Digital Sky Survey IV: mapping the Milky Way, nearby galaxies, and the distant universe
We describe the Sloan Digital Sky Survey IV (SDSS-IV), a project encompassing three major spectroscopic programs. The Apache Point Observatory Galactic Evolution Experiment 2 (APOGEE-2) is observing hundreds of thousands of Milky Way stars at high resolution and high signal-to-noise ratios in the near-infrared. The Mapping Nearby Galaxies at Apache Point Observatory (MaNGA) survey is obtaining spatially resolved spectroscopy for thousands of nearby galaxies (median ). The extended Baryon Oscillation Spectroscopic Survey (eBOSS) is mapping the galaxy, quasar, and neutral gas distributions between and 3.5 to constrain cosmology using baryon acoustic oscillations, redshift space distortions, and the shape of the power spectrum. Within eBOSS, we are conducting two major subprograms: the SPectroscopic IDentification of eROSITA Sources (SPIDERS), investigating X-ray AGNs and galaxies in X-ray clusters, and the Time Domain Spectroscopic Survey (TDSS), obtaining spectra of variable sources. All programs use the 2.5 m Sloan Foundation Telescope at the Apache Point Observatory; observations there began in Summer 2014. APOGEE-2 also operates a second near-infrared spectrograph at the 2.5 m du Pont Telescope at Las Campanas Observatory, with observations beginning in early 2017. Observations at both facilities are scheduled to continue through 2020. In keeping with previous SDSS policy, SDSS-IV provides regularly scheduled public data releases; the first one, Data Release 13, was made available in 2016 July
- …